Python Web Scraping, Second Edition by Packt Publishing
Author:Packt Publishing
Language: eng
Format: mobi
Publisher: Packt Publishing
Published: 2017-05-29T09:14:54+00:00
def process_queue():
while len(crawl_queue):
url = crawl_queue.pop()
...
The first change is replacing our Python list with the new Redis-based queue, named RedisQueue. This queue handles duplicate URLs internally, so the seen variable is no longer required. Finally, the RedisQueue len method is called to determine if there are still URLs in the queue. Further logic changes to handle the depth and seen functionality are shown here:
## inside process_queue
if no_robots or rp.can_fetch(user_agent, url):
depth = crawl_queue.get_depth(url) or 0
if depth == max_depth:
print('Skipping %s due to depth' % url)
continue
html = D(url, num_retries=num_retries)
if not html:
continue
if scraper_callback:
links = scraper_callback(url, html) or []
else:
links = []
# filter for links matching our regular expression
for link in get_links(html, link_regex) + links:
if 'http' not in link:
link = clean_link(url, domain, link)
crawl_queue.push(link)
crawl_queue.set_depth(link, depth + 1)
The full code can be seen at http://github.com/kjam/wswp/blob/master/code/chp4/threaded_crawler_with_queue.py.
This updated version of the threaded crawler can then be started using multiple processes with this snippet:
import multiprocessing
Download
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Eco-friendly approach of bio-indigo synthesis and developing purification methods towards isolation of indigo from indirubin and bacterial fragments by Ramalingam Manivannan & Kaliyan Prabakaran & Young-A Son(151364)
Whisky: Malt Whiskies of Scotland (Collins Little Books) by dominic roskrow(74270)
CONSORT 2025 statement: updated guideline for reporting randomized trials by unknow(66073)
Critical evaluation of the ProfiLER-02 study design and outcomes by Vivek Subbiah & Razelle Kurzrock(65823)
Cardiac gene therapy makes a comeback by Oliver J. Müller & Susanne Hille & Anca Kliesow Remes(65260)
Unveiling the design rules for tunable emission in graphene quantum dots: A high-throughput TDDFT and machine learning perspective by Şener Özönder & Mustafa Coşkun Özdemir & Caner Ünlü(50857)
A yeast-based oral therapeutic delivers immune checkpoint inhibitors to reduce intestinal tumor burden by unknow(34984)
Covalent hitchhikers guide proteins to the nucleus by Alexander F. Russell & Madeline F. Currie & Champak Chatterjee(34903)
Meet the Authors: Christopher R. Mansfield and Emily R. Derbyshire by Christopher R. Mansfield & Emily R. Derbyshire(34644)
What's Done in Darkness by Kayla Perrin(27103)
Topological analysis of non-conjugated ethylene oxide cored dendrimers decorated with tetraphenylethylene: Insights from degree-based descriptors using the polynomial approach by A Theertha Nair & D Antony Xavier & Annmaria Baby & S Akhila(26482)
Investigation of mechanical and self-healing properties of hydroxyl-terminated polybutadiene functionalized with 2-ureido-4-pyrimidinone by Mohsen Kazazi & Mehran Hayaty & Ali Mousaviazar(26435)
The Ultimate Python Exercise Book: 700 Practical Exercises for Beginners with Quiz Questions by Copy(21013)
De Souza H. Master the Age of Artificial Intelligences. The Basic Guide...2024 by Unknown(20773)
D:\Jan\FTP\HOL\Work\Alien Breed - Tower Assault CD32 Alien Breed II - The Horror Continues Manual 1.jpg by PDFCreator(20647)
The Fifty Shades Trilogy & Grey by E L James(19604)
Shot Through the Heart: DI Grace Fisher 2 by Isabelle Grey(19486)
Shot Through the Heart by Mercy Celeste(19345)
Wolf & Parchment: New Theory Spice & Wolf, Vol. 10 by Isuna Hasekura and Jyuu Ayakura(17490)